In this tutorial i'll be talking about style transfer with TensorFlow.
The algorithm is proposed by Gatys et al, on the paper A Neural Algorithm Artistic Style which were firstly proposed.
Any help, suggestions, or criticism will be very welcome! :D
Obs
You'll away have to generate a new graph, if you want to change the images.
It isn't the best way to implement this, but i guess that it'll be easier to understand.
Style transfer is an incredible technique that through convolutional neural networks (CNN) mixture the style of an image with the content of another image.
When we talk about content, we're referring to the lines that contour the image or the 'sketch' of them.
The style would be the texture of the image.
Source: A Neural Algorithm Artistic Style
The concept of style transfer is easy to understand. On the paper "A Neural Algorithm for Artistic Style", we have the following loss function that we should minimize.
$$ L_{total}(\vec{p}, \vec{a}, \vec{x})\ =\ \alpha \times L_{content}(\vec{p}, \vec{x})\ +\ \beta \times L_{style}(\vec{a}, \vec{x}) $$On a first sight is scary, however it's easy to understand.
Now, let's import some dependencies and develop the loss function above in order to apply our style transfer.
# The VGG19 model the we'll be using
import vgg
# functions to handle images
from helper import load_image, plot_image
# to handle vectors/matrices
import numpy as np
# The TensorFlow library
import tensorflow as tf
# an optimizator
from tensorflow.contrib.opt import ScipyOptimizerInterface
#
import time
from os.path import isfile
The following code is just to check if you're using a GPU.
If you're using a CPU, it might take much of your time to generate the image.
if not tf.test.is_gpu_available():
print('No GPU available.', file=sys.stderr)
else:
print('Default GPU: {}'.format(tf.test.gpu_device_name()))
Befre we apply the style transfer, let's talk about the VGG network.
Our VGG model is already pretrained and it's available on the internet.
The VGG model is the one with 19 layers. Each layer contains a filter collection where each filter extract some image feature.
The model we're be using was designed to receive inputs rank 4 matrices (batch, height, width, depth) and there's a preprocessing before feedforward the network.
Our images are rank 3, then we'll need to add another dimension and preprocess them.
# Checking if the pretrained model does exist
assert isfile('imagenet-vgg-verydeep-19.mat'), 'The pretrained model should be on the project root'
# loading both the pretrained model and the RGB filters mean
# the filters are for preprocessing the images
vgg_weights, vgg_mean = vgg.load_net('imagenet-vgg-verydeep-19.mat')
As we talked before, the content and style images we'll be constants.
Let's load our images.
The content image will be my cat Mavis taking his incredible nap.
content_image = load_image('images/mavis.jpg', shape=(600, 336))
style_image = load_image('images/style.jpg', shape=(content_image.shape[1], content_image.shape[0]))
# just adjusting the image for plotting
img = plot_image(content_image)
img = img.rotate(270)
img.crop((132, 0, 468, 336))
The style image will be the Eternal Winter III by Aenea-Jones that I found on DeviantArt.
plot_image(style_image)
# adding another dimension on the images
content_image = content_image[None,:,:,:]
style_image = style_image[None,:,:,:]
# preprocessing them
content_image = vgg.preprocess(content_image, vgg_mean)
style_image = vgg.preprocess(style_image, vgg_mean)
print('Content Image Shape: {}'.format(content_image.shape))
print('Style Image Shape: {}'.format(style_image.shape))
$\alpha$ and $\beta$ are weighting factors for the content and style reconstrution.
On the papers, the writers used a ratio $\frac{\alpha}{\beta}$ equals to 1e-3 and 1e-4.
Here we'll be using the ratio of 1e-3.
The content layer will be the conv4_2 (we can add more layers).
And the style layers will be relu1_1, relu2_1, relu3_1, relu4_1 e relu5_1.
The weighting factors mean how much each layer will contribute to the loss function.
# max amount of iteration
max_iter = 1000
# weighting factors
alpha = 1e-3
beta = 1.
# content layers and their weights
content_layers = ['conv4_2']
content_weights = [1.]
# style layers and their weights
style_layers = ['relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1']
style_weights = np.ones([5]) * .2
The content loss function is given by $$L_{content}(\vec{p}, \vec{x}, l)\ =\ \frac{1}{2}\ \sum_{i,j} \big(F_{ij}^l - P_{ij}^l\big)^2$$
Here are some considerations:
So, let's build the content loss function!
# the graph that we'll be working on
graph = tf.Graph()
with graph.as_default() as g:
# as said before, p we'll be a constant
p = tf.constant(content_image, tf.float32, name='p')
# here we're feedforwarding the content image
with g.name_scope('content'):
vgg_content = vgg.net_preloaded(vgg_weights, p, 'max')
# this we'll be the image that we're generating, so it'll be a variable
# this is the variable that we're minimizing
x = tf.get_variable('x', p.get_shape(), dtype=tf.float32, initializer=tf.random_uniform_initializer())
with g.name_scope('mixture'):
vgg_mixture = vgg.net_preloaded(vgg_weights, x, 'max')
# content loss
content_loss = 0
for layer, weight in zip(content_layers, content_weights):
P = vgg_content[layer] # getting the content feature representation
F = vgg_mixture[layer] # getting the mixture feature representation
content_loss += weight * .5 * tf.reduce_sum(tf.square(F-P))
The formula is translated on the last code line.
(...) .5 * tf.reduce_sum(tf.square(F-P))
Before we talk about the style loss, let's talk about the Gram matrix.
The matrix is given by $$ G_{i,j}^l = \sum_k F_{ik}^lF_{jk}^l $$
The Gram matrix is the inner product between the vectorised feature map. It's just the inner product between a matrix and its transpose.
This matrix computes the correlations between the different filter responses.
def gram_matrix(tensor):
shape = tensor.get_shape()
num_channels = int(shape[3])
matrix = tf.reshape(tensor, [-1, num_channels])
gram = tf.matmul(tf.transpose(matrix), matrix)
return gram
The style loss function is given by $$ L_{style}(\vec{a}, \vec{x})\ = \sum_{l=0}^L\ w_l \times \frac{1}{4 \times N_l^2 \times M_l^2} \sum_{i,j} \big(G_{ij}^l - A_{ij}^l\big)^2 $$ where:
style_weights = np.ones([5]) * .2
Let's build the style loss function!
with graph.as_default() as g:
# the style image will be defined as constant too
a = tf.constant(style_image, tf.float32, name='a')
# and feedforwarding the style image
with g.name_scope('style'):
vgg_style = vgg.net_preloaded(vgg_weights, a, 'max')
# style loss
style_loss = 0
for layer, weight in zip(style_layers, style_weights):
_, h, w, d = vgg_mixture[layer].get_shape()
M = h.value * w.value # size of the filters
N = d.value # amount the filters
A = gram_matrix(vgg_style[layer]) # getting the style feature representation
G = gram_matrix(vgg_mixture[layer]) # getting the mixture feature representation
style_loss += weight * tf.reduce_sum(tf.square(G-A)) * ((1.)/(4. * N**2 * M**2))
And again our formula is translated on the last code line.
As said before, our goal is to minimize the function
$$ L_{total}(\vec{p}, \vec{a}, \vec{x})\ =\ \alpha \times L_{content}(\vec{p}, \vec{x})\ +\ \beta \times L_{style}(\vec{a}, \vec{x}) $$here, we'll be jointly minimizing the distance of the mixture image from content representation and the style representation in layers of the CNN.
with graph.as_default() as g:
# total loss
total_loss = alpha*content_loss + beta*style_loss
# optimizer
optimizer = ScipyOptimizerInterface(total_loss, method='L-BFGS-B', options={'maxiter': max_iter})
Now we can run our style transfer graph.
I ran the graph on my GPU GeForce 930M 4GB, 1000 iterations last approximately 22 minutes.
If you want to see immediately the results, with 100 iterations should be enough.
However, the results are better with 500-2000 iterations.
Before we run it, let's build a function to give us some feedback while it runs.
call_iter = 0
results = {}
def call_back(x):
global call_iter
global results
if call_iter % 10 == 0:
print('\riter: {}'.format(call_iter), end='')
if call_iter % 10 == 0 and call_iter <= 100:
results[call_iter] = x
call_iter +=1
start = time.time()
with tf.Session(graph=graph) as sess:
sess.run(tf.global_variables_initializer())
optimizer.minimize(sess, fetches=[x], loss_callback=call_back)
output = sess.run(x)
print('Elapsed time: {}'.format(time.time() - start))
Before we plot the final result, let's undo the preprocessing.
final = vgg.unprocess(output, vgg_mean)
img = plot_image(np.clip(final, 0, 255)[0])
# result after 1000 iterations
img = img.rotate(270).crop((132, 0, 468, 336))
img
We saved the first 10 images (result of the first 100 iteration mod 10).
Let's see how the image developed along the 100 iterations...
from PIL import Image
keys = results.keys()
keys = list(keys)
keys = sorted(keys)
keys = keys[1:]
new_im = Image.new('RGB', (img.size[1] * 5, img.size[0] * 2))
i = 0
for col in range(2):
for row in range(5):
key = keys[i]
image = vgg.unprocess(results[key], vgg_mean)
image = plot_image(np.clip(image, 0, 255)[0])
image = image.rotate(270)
image = image.crop((132, 0, 468, 336))
new_im.paste(image, (row * image.size[1], col * image.size[0]))
cols += final.shape[1]
i += 1
new_im